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Abstract. RNA viruses provide prominent examples of measurably evolving populations. 
In HIV infection, the development of drug resistance is of particular interest, because precise 
predictions of the outcome of this evolutionary process are a prerequisite for the rational 
design of antiretroviral treatment protocols. We present a mutagenetic tree hidden Markov 
model for the analysis of longitudinal clonal sequence data. Using HIV mutation data from 
clinical trials, we estimate the order and rate of occurrence of seven amino acid changes that 
are associated with resistance to the reverse transcriptase inhibitor efavirenz. 



1. Introduction 

In the developed world, patients infected with human immunodeficiency virus (HIV) are 
treated with combinations of several antiretroviral drugs in order to maximally suppress virus 
replication. However, the development of drug resistant virus variants limits the success of 
this intervention (Clavel and Hance, 2004). The genetic basis for the emergence of drug 
resistance is the high mutation rate in HIV, which is a result of the absence of a proof-reading 
mechanism for reverse transcription of RNA into DNA. The frequent mutations may alter the 
genetic composition of drug targets and create drug resistant mutants that will outcompete 
the wild type virus and eventually dominate the intra-host virus population. The applied 
drug cocktail then becomes ineffective and the patient experiences an increase in virus load. 

The accumulation of resistance-conferring mutations is a stochastic process that typically 
follows one of several co-existing evolutionary pathways (Boucher et al., 1992; Molla et al., 
1996; Shafer and Schapiro, 2005). Understanding the pathways to drug resistance is important 
for the design of effective drug combinations. For the study of these pathways, two types of 
DNA sequence data, based on either population or clonal sequencing, are available. 

In routine clinical diagnostics, viral drug targets are sequenced after therapy failure and 
prior to a therapy switch. Typically, population sequencing is applied, which results in one 
DNA sequence that represents the mixture of viruses in the population. This approach detects 
only those mutations that are present in at least 20% of the strains. Furthermore, the phase 
information, i.e., the knowledge whether polymorphisms at different sites actually co-occur 
on the same genome, is lost in population sequencing. Abundant in public databases (Rhee 
et al., 2003), cross-sectional data obtained from population sequencing have been used to 
demonstrate correlations between the occurrences of different mutations (Gonzales et al., 
2003; Sing et al., 2005). Mutagenetic tree models, a subclass of Bayesian networks based on 
directed trees (Beerenwinkel et al., 2005b), take such correlation analysis a step further by 
providing a tool particularly suited to infer evolutionary pathways to drug resistance from 
cross-sectional data (Beerenwinkel et al., 2005a). 

A more elaborate alternative to population sequencing is clonal sequencing. In this ap- 
proach, multiple viral genomes are independently amplified by PGR and sequenced. This 
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strategy has two major advantages over population scqTicncing. First, since mutations may 
occur in PGR reactions, the use of multiple independent PGR reactions reduces the impact of 
early and erroneous introduction of mutations at this step of the analysis. Second, in clonal 
sequencing, the genotype of a single strain is determined. Prom the resulting haplotypes the 
linkage between mutations is readily assessed. 

While haplotypes provide exact information on the correlation between mutations, we can 
gain additional insight about the order in which substitutions are fixed into the population 
from longitudinal data obtained by sequencing viruses from the same patient at multiple time 
points. In this paper, we propose an extension of the mutagenetic tree model to a model for 
longitudinal clonal HIV sequence data. The new model captures both the time structure per 
patient and the clonal variation per time point, and thus allows to employ time structure in 
the estimation of substitution rates of resistance-conferring mutations. Knowledge of these 
substitution rates, which can be interpreted as waiting times, provides the basis for rational 
therapy planning. Indeed, using viral evolutionary information encoded in mutagenetic trees 
has recently been shown to significantly improve predictions of in vivo virological response to 
antiretroviral combination therapies (Beerenwinkel et al., 2005d). 

Rather than modeling viral evolution in full generality, we intend to describe a specific 
phase of evolution in terms of a specific set of genetic events. More precisely, we model amino 
acid changes that confer drug resistance under the constant selective pressure of a given drug. 
The data analyzed in this paper comprise a total of 3350 clones that have been collected 
from 163 patients at different time points during three phase II clinical studies of the reverse 
transcriptase (RT) inhibitor efavirenz (Bacheler et al., 2001, 2000). We focus on seven amino 
acid substitutions in the HIV-1 RT that are associated with resistance to efavirenz (cf. Table 1, 
where each observed clone is coded as a binary vector of length seven). Working on protease 
sequences obtained from the same efavirenz studies, Foulkes and DeGruttola (2003) have 
modeled evolutionary pathways to drug resistance by grouping the observed clones into five 
clusters and estimating transition rates between clusters. In that approach, however, neither 
the dependence structure between single mutations nor their rates of occurence is modelled 
explicitly. In addition, clonal variation is not incorporated explicitely. 

Under treatment with efavirenz, virus strains harboring one or more of the seven consid- 
ered resistance mutations have a significant fitness advantage. As a result, these mutations 
accumulate in the population during the course of therapy (Boucher et al., 1992; Crandall 
et al., 1999; Molla et al., 1996). When modelling this accumulative evolutionary process we 
make the following three assumptions: 

(Al) Substitutions do not occur independently. There are preferred evolutionary pathways 
in which mutations are fixed. 

(A2) The fixation of mutations into the populations is definite, i.e., substitutions are non- 
reversible. 

(A3) At each time point, the virus population is dominated by a single strain and clones 
are independent and (possibly erroneous) copies of this genotype. 

As mentioned above, previous studies have provided overwhelming evidence for strong 
dependencies between resistance mutations, as stated in assumption (Al). In fact, the fixation 
of mutations typically exhibits order constraints that reflect properties of the underlying 
fitness landscape (Beerenwinkel et al., 2006). For example, if two advantageous mutations 
show positive epistasis, then they interact synergistically and the double mutation will be fixed 
more often than expected from the frequencies of the individual substitutions (Michalakis and 
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Roze, 2004). In other words, the mutations are more hkely to occur on the same mutational 
pathway. On the other hand, if advantageous mutations exhibit negative epistasis, then they 
interact antagonistically, resulting in mutants that are less fit than expected or even not viable 
at all. The mutagenetic tree model accounts for lethal mutational patterns by excluding some 
binary vectors from its state space, which thus encodes only viable mutant types. Mutational 
patterns in the state space arise from mutation accumulation along one of the pathways 
specified in the tree and we call these patterns compatible with the tree. 

Assumption (A2) refers to the non-reversibility of substitutions. It is motivated by the 
strong selective advantage that each resistance mutation may confer. Due to this strong selec- 
tive advantage, we interpret the fixation of each resistance mutation as a selective sweep, after 
which virtually all viable viruses exhibit the mutation. Together with assumption (Al), this 
implies that resistance mutations accumulate along different pathways by successive sweeps 
and that both back substitutions and incompatible states result in viruses whose relative 
fitness is so low that they can be ignored. 

The third assumption (A3) ties in with our point of view of a series of selective sweeps. In- 
deed, immediately after such an incident the population will consist of the new advantageous 
mutant type and its recent descendants, which are therefore phylogenetically independent. 
Under this assumption a virus population should exhibit very low genetic diversity. In light 
of the phylogenetic independence, the aim of our work is different from that of work em- 
ploying population genetic or phylogenetic methods (Drummond et al., 2002, 2003). Those 
approaches explicitely model the mutation and selection process and typically aim at inferring 
global population parameters, such as the global rate of adaption (Williamson, 2003) or the 
effective population size (Seo et al., 2002). By contrast, we model the joint probability distri- 
bution resulting from this evolutionary process directly, with the goal of inferring individual 
substitution rates, as the presence of specific mutational patterns determines the therapeutic 
options of each patient. 

The three assumptions discussed above lead us to a hidden Markov model (HMM) whose 
state space consists of the compatible states of a given mutagenetic tree. We consider unob- 
servable population states and assume a simple error process involving only a false positive 
and a false negative rate that generates clones from this hidden state. The fixation of each 
mutation is modeled by an independent Poisson process. Thus, at each time point, we assume 
the virus population to be dominated by a single mutational pattern and all clonal variation 
of this unobserved state is modelled as resulting from erronous copying of this strain. 

The remaining part of this paper is organized as follows. In Section 2 wc present the 
data on the seven considered mutations associated with resistance to efavirenz and perform 
simple permutation tests to assess the validity of assumptions (A2) and (A3). In Section 3, 
we develop the above outlined mutagenetic tree HMM (mtrcc-HMM). We present the results 
from fitting the mtrcc-HMM to the efavirenz data in Section 4 and conclude in Section 5 
where we discuss some of the limitations of our approach. 



2. Longitudinal clonal HIV sequence data and model assumption 

2.1. HIV data. We subsequently analyze a set of longitudinal clonal HIV sequences that have 

been collected during three clinical studies of efavirenz combination therapy (DMP 266-003, 
-004, -005). This data set is publicly available at the Stanford HIV drug resistance database 
(Rhee et al., 2003). Selection of samples, RNA amplification, cloning, and sequencing have 



4 



N. BEERENWINKEL AND M. DRTON 



been described in detail by Bachclcr ct al. (2000) and Bacheler et al. (2001). All patients 
received the non- nucleoside RT inhibitor efavirenz. 

Bacheler et al. (2000) identified a list of amino acid changes in the HIV RT associated with 
resistance to efavirenz. In particular, they described two alternative pathways to efavirenz re- 
sistance, one initiated by mutation K103N (103-pathway), the other by Y188L (188-pathway). 
In the present data set, for each patient, at most one of these two pathways occurs. We focus 
here on the 103-pathway, because mutations associated with the 188-pathway were found in 
only 7 out of 170 patients, which we excluded from further consideration. A total of 3350 
clones have been derived from the remaining 163 patients at between 1 and 11 different time 
points (median = 3, IQR 1-5). From the list of efavirenz resistance mutations associated with 
the 103-pathway we selected those that occur in at least 2% of clones. This strategy resulted 
in the seven mutations 103N (frequency 48.1%), 225H (8.0%), 1081 (4.9%), lOlQ (2.7%), 190S 
(2.7%), lOlE (2.5%), and 1001 (2.3%). 

As an example for mutational patterns found in the data set. Table 1 shows the clones 
observed in patient 22, which is also highlighted in Bacheler et al. (2000, Tab. 2) (Viterbi 
paths such as the one given in Table 1 are discussed in Section 4). This patient is atypical 
in that before the onset of therapy (week 0), two clones had mutation 103N. In fact, among 
the 1144 baseline clones only very few already featured the mutations considered here, i.e., 
mutations 103N (frequency 1.9%), 190S (1.3%), lOlQ (0.7%), lOlE (0.35%), and 1081 (0.08%). 
Mutations 1001 and 225H were not present in any baseline clone. Only six clones (0.52%) 
harbored a double mutation, namely 103N+190S. 

According to Table 1, the first selective sweep, introducing mutation 103N in the virus 
population of patient 22, has occurred by week 48. By week 70, mutation 225H shows definite 
signs of manifestation. This behaviour is in support of the assumptions (A2) and (A3) put 
forward in the Introduction. For a quantitative analysis of these assumptions, we perform 
randomization tests. 

2.2. Statistical tests. Let = 163 be the number of patients in our data set, and let Kij 
be the number of clones observed in patient i G [N] := {1, . . . ,N} at the j-th time point Uj. 
Moreover, let yijkm G {0, 1} be the indicator of the presence of mutation m G [M] := {1, . . . , 7} 
in clone k G [Kij] derived from patient i £ [N] at time point tij. We use two test statistics 
for evaluating the assumptions (A2) and (A3), respectively. In both cases, we treat different 
patients as independent observations. 

The non-reversibility of substitutions (A2) is tested by tracing the change in mutant allele 
frequency for each patient over time. This frequency is 



For each mutation m, our test statistic counts how often its frequency decreases from one 
time point to the next. Thus, 



where / denotes the indicator function. We estimate the distribution of Am under the null 
hypothesis of equal probability of increase and decrease of allele frequencies by randomizing 
the temporal order of the clone populations. Under non-reversibility of substitutions, the 
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Table 1. Mutation data and inferred Viterbi path for patient 22. Presence 
and absence of mutations are encoded as "1" and "0" , respectively. At each 
time point (week) the first row represents the inferred population state (in 
italics), all other rows are unordered and correspond to observed clones. 



observed value of the test statistic Am should be small compared to the randomization-based 
values of Am- 
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Figure 1. Non-reversibility of substitutions. For each mutation, the loga- 
rithm of the number of times for which its frequency in the population strictly 
decreases between successive time points is displayed (bold discs). The distri- 
bution of this test statistic was estimated by shuffling 1000 times the order of 
time points (box plots). 

In our data, mutation losses occur in only 1.5 to 9.4% of the cases (Figure 1). In other 
words, for all seven considered mutations the frequency of the mutant increases or stays 
constant in over 90% of pairs of subsequent virus populations. For most mutations, this 
value was significantly smaller than expected (piooi = 0.011, pioiE = 0.013, piosN < 0.001, 
P108I < 0.001, P225H < 0.001) with the exceptions of mutations lOlQ (pioiQ = 0.056) and 
190S (pi9os = 0.465). We conclude that assumption (A2) of non-reversible substitutions is 
reasonable for the majority of the data. 

For assessing the validity of assumption (A3), we measure the genetic diversity among 
mutational patterns by the Hamming distance. If c and c' are two clones, their Hamming 
distance is defined as 

Dh{c,c')= I{Cm^c'^}. 

me[M] 

The diversity of a population of clones ci, . . . ,ck is the expected Hamming distance between 
two clones drawn at random from that population, 

, K(K - 1) 

Dh{Ci, . . . ,Ck) = 2^ DH{Ck,Ck'). 

k<k' 

Our test statistic is the expected population diversity 

1 1 

i(z[N] * j=l 

We consider the null hypothesis of observing the full diversity present in all clones from one 
patient at each single time point. The distribution of B under this null is estimated by 
randomly permuting the assignment of clones to time points. Observed values at the left 
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Figure 2. Genetic diversity. The logarithm of the average Hamming dis- 
tance between clones from the same time point is displayed (bold disc). The 
distribution of this test statistic was estimated by randomizing 1000 times the 
assignment of clones to time points (histogram). 

tail of this distribution indicate less genetic diversity per time point than expected from the 
total diversity present in the clones. Large population diversities would put into question our 
assumption of a single population state. 

For our data, grouping clones according to their sampling time reduces the genetic diversity 
more than expected from a random grouping. We find that on average clones from the same 
population differ in only 1 out of 1900 of the considered mutations (Figure 2). For random 
clone groupings the expected value is 1 in 300. This difference is highly significant (p < 0.001). 
Thus, most genetic diversity occurs between time points rather than within time points, and 
most of the observed populations indeed look like having gone through a selective sweep. We 
therefore conclude that our assumptions described in the Introduction and formalized in the 
mutagenetic tree HMM, which is described next, are in reasonable agreement with the clonal 
HIV mutation data considered here. 

3. Mutagenetic tree hidden Markov model 

We begin this section by recalling the basic definition and properties of the mutagenetic 
tree model for cross-sectional data. We first extend this basic model to a mutagenetic tree 
Markov chain model for multiple observations per patient in time. Second, we introduce the 
mutagenetic tree hidden Markov model (mtree-HMM), which assumes the evolutionary state 
of the virus population to be unobservable. An EM algorithm for parameter estimation is 
presented, the details of which are given in the Appendix. 

3.1. Mutagenetic tree model. Consider M genetic events of interest, which we also call 
mutations. We treat these events as random and use binary random variables to indicate 
their occurrence. For each mutation m G [M] := {1, . . . ,M}, let be a binary random 
variable taking values in {0, 1} such that {Xm = 1} and {Xm = 0} indicate the occurrence 
and non-occurrence of mutation m, respectively. In the HIV application, [M] represents the 
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set of seven amino acid changes in the HIV RT described in Section 2 that confer resistance 
to efavirenz, and X„i indicates the fixation of mutation m into the virus population. For 
convenience, we additionally introduce a random variable Xq with Prob(Xo = 1) = 1. In 
this setup, the mutational pattern representing the evolutionary state of the virus population 
corresponds to the binary vector X = {Xi, . . . ,Xm) that takes values in the state space 
I = {0,1}^- 

A mutagenetic tree T on M events is a connected branching on the set of vertices V = 
{0} U [M], rooted at node (Beerenwinkel et al., 2005b); in the cancer literature, such a tree 
has been termed oncogenetic (Desper et al., 1999; von Heydebreck et al., 2004). In particular, 
this tree is an acyclic directed graph and thus induces a directed graphical model for the 
joint distribution of the random vector X (Lauritzen, 1996). The joint distributions for X 
comprised in this graphical model factor as 

Prob(X =x)= Yi Prob(X^ = Xm \ ^pa(m) = a;pa(m))> 
me[M] 

where x e 1, Xq ■= 1, and pa(m) = pa/p(m) denotes the parent of m in T. The mutagenetic 
tree model induced by the tree T is a submodel of the usual graphical model obtained by 
restricting the conditional probability matrices to the form 

1 
/ 1 



1?"" = , 

1 \^ 1 - i9'{[ ^'{{ 

where = Prob(Xm = b \ ^pa(m) = a) G [0,1]. The mutagenetic tree model imposes the 
same conditional independence structure as the graphical model but in addition imposes the 
constraint that an event can occur only if all of its ancestor events have already occurred 
(Beerenwinkel and Drton, 2005, Thm. 14.6). This fact follows from the first row of the 
transition matrix. As a consequence, mutations can only occur in specific pathways. More 
precisely, each mutational pattern corresponds to a subtree of the mutagenetic tree and the 
pathways correspond to sequences of subtrees related by inclusion and increasing by exactly 
one vertex (mutation). The extreme cases are the star and the path. In the star, each vertex 
has the root vertex as parent and all mutational pathways are possible. In the path, the 
vertices form a chain starting with the root vertex. In this case, there is only one pathway. 

A mutational pattern x € I is called compatible with the tree T, if x is a state that may 
occur with non-zero probability in the mutagenetic tree model induced by T. Hence, x is 
compatible with T, if there exist parameters i? = (i^ii, • • • ,'!?n) € [0, l]'^ such that 

Prob(X = x)= n C . ^ > 0- 

me[M] 

The set C(T) of states that are compatible with the tree T forms a finite distributive lattice 
(C(r),V,A), where V and A denote the component- wise maximum and minimum operator, 
respectively (Beerenwinkel and Drton, 2005, Lem. 14.3), cf. Figure 3(a), (b). The so-called 
Hasse diagram in Figure 3(b) illustrates the above-mentioned pathways in that any such 
pathway corresponds to a path from the (wild-type) state (0, ...,0) G C{T) to the state 
(1, . . . , 1) G C(T) with all mutations present. 

In a timed mutagenetic tree model, mutations occur according to independent Poisson pro- 
cesses. If > denotes the rate of this process on the edge pa(m) m, then the probability 
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Figure 3. (a) Mutagenetic tree on five vertices; and (b) its induced lattice 
of compatible states. 

that mutation m occurs during a time period of length At is 

^Ti = Prob(X^ = 1 I Xp,(„) = 1) = 1 - e-^-^*. 

Mutagenetic trees can be reconstructed from cross-sectional data by solving the maximum 
weight branching problem in the complete graph on the set of vertices V with an efficient 
combinatorial algorithm (Dcspcr ct al., 1999). This algorithm is implemented, for example, 
in the Mtreemix software package (Beerenwinkel et al., 2005c). Here, we use the algorithm for 
obtaining the topology (but not the parameters) of the tree. For this purpose, the longitudinal 
data were treated as cross-sectional. 

3.2. Markov chain model. Assume now that we can observe the viral mutational pattern 
within one patient at more than one time point. More precisely, let Xj^ be a binary random 
variable indicating the occurrence of mutation m at time point tj, j = 1, . . . , J, in the patient's 
virus population. We assume that the evolutionary process starts at time in the wild type 
state, i.e. without any mutation, which is the case for the majority of our data (cf. Section 2.1). 
Thus, the initial population state distribution follows the timed mutagenetic tree model. In 
particular, Xim = for all m € [M] with ti = 0. 

For the temporal evolution of the mutational patterns Xj = {Xji, . . . , Xjm), j = ^, ■ ■ ■ , J, 
we make the assumption that these vectors form a Markov chain observed at fixed time points 
ti < t2 < ■ • • < tj. Furthermore, we assume that a mutation occurs at the time it is observed. 
In the HIV application, observations can only be made if the virus load is above a detectable 
limit resulting in censored observations. However, prior work of Foulkes and DeGruttola 
(2003) suggests that the impact of the censoring is minor (cf. Section 5). 

The development of mutation m at a time point tj with j > 2 which is encoded in the 
random variable Xjm, depends on past mutational events as well as current ancestral mu- 
tational events only through two indicators. The first one is Xj^i^m and indicates whether 
the mutation was already present at the immediately preceding time point tj-i. The second 
one is Xjpa(m) and indicates whether the parent mutation is present at time point tj. These 
dependencies arise from modeling the presence of mutation m at time tj as resulting either 
from its introduction along the edge pa(m) ^ m at time point tj, or from its earlier non- 
reversible introduction into the population and hence its presence at time point tj-i. The 
dependency structure among the variables {Xj^ \ j = 1, . . . , J, m E [M]) can be encoded in 
an acyclic directed graph. For example, the subgraph of the graph shown in Figure 4 induced 
by the unshaded vertices represents the mutagenetic tree Markov chain based on the tree in 
Figure 3(a). 
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The dynamics of this Markov chain model are encoded by the transition matrices 

1 
00 / 1 ° \ 

V 1 y 

whose rows are indexed by pairs (m,pa(m)) in {0,1}^. The asterisks indicate entries that 
need not be specified, because no event m can occur before its parent event pa(m). With 
these matrices we define the mutagenetic tree Markov chain model as the family of joint 
distributions of the form 

J 

Prob(Xj^ = Xjm, j = 1,...,J, me [M]) = H 11 ^j(^"^)i^j-i,m,xj^.(m))w 

j=lme[M] 

where Xjo ■= 1 and to :=0. It follows that 

Prob(Xj^ = Xjm I -^j-l,m = Xj-i^m, -'^jpa(m) = Xjpa{m)) = ^j{^rn){xj-i^rn,Xjps,(m)),Xjm.- 

3.3. Mtree-HMM. We now turn to the case of observed clonal samples, rather than pop- 
ulation states, at different time points. We model these data by assuming that the clones 
are erroneous copies of a hidden mutational pattern that evolves according to a mutagenetic 
tree Markov chain. Hence, Xj^, is now a hidden binary random variable. The observed data 
are instances of binary random variables Yj/^m, k G [Kj] that indicate whether mutation m 
is present in the k-th clone sampled from the virus population at time point tj. Clones are 
conditionally independent given the hidden states (Xj)j=i^...^j. The resulting graphical model 
is referred to as the mutagenetic tree hidden Markov model, or mtree-HMM for short. Its 
graph is shown in Figure 4. 

Let = (e^^, . . . , e\^) G [0, 1]*'' and = {e~[ , . . . , e^) G [0, 1]^ be parameter vectors that 
contain the mutation-specific probabilities of observing a false positive and a false negative, 
respectively. False positives (negatives) are mutations (wild type residues) observed in clones 
derived from a virus population that is in the wild type (mutant) state at that time point. 
The false positive and false negative rates summarize differences from the population state 
that can arise from mutations in the PGR reactions, or from erroneous viral copying of the 
dominating strain. Thus, these parameters quantify the expected genetic diversity of the virus 
population. Conditionally upon the hidden state Xjjni the probabilities of observing mutation 
m in clone k at time point tj are 

1 



The entries of this matrix are the conditional probabilities 



^ {^nn^m)xjm,yjkm ~ Prob(l^-fcm — VjUm \ Xj 



The different clones Yjk = (Yjki, ■ ■ ■ ,Yjkm), k G [Kj], are thus modeled as independent and 
identically distributed. Let 

Y = (Yjkm \j = l,...,J,ke[Kj],m€ [M]) 
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Figure 4. Acyclic directed graph representing an mtree-HMM based on the 
tree from Figure 3(a) with J = 3 time points, two clones at time point ti, one 
clone at t2, and three clones at t^. Clear vertices correspond to hidden random 
variables, shaded vertices to observable variables. 



denote all clonal sequence observations. The mutagenetic tree hidden Markov model (mtree- 
HMM) is the family of joint distributions of Y given by 



Prob(y = y) 



E 

xieC(T) 



E n n 

xjeC{T) melM]j=l 



^3 ^'^'^^ {'^j — l,7Ti)^jpa(7Ti))''^jr 



n ^'(^ 



mi '^mlXjm,yjkn 



where we have summed over the hidden states. The graphical model structure of the mtree- 
HMM is illustrated in Figure 4. 

We remark that on an abstract level mtree-HMMs are related to phylogenetic HMMs (phylo- 
HMMs) (McAuliffe et al., 2004; Siepel and Haussler, 2004). Phylo-HMMs are probabihstic 

models of sequence evolution that combine phylogenetic trees at different sites of aligned 
genomes with an HMM that allows for dependence across sites. Thus, in phylo-HMMs the 
tree models arise through dependence in time, and the HMM represents a dependence in space 
(along the genome). While mtree-HMMs are similarly structured in that tree-based models 
arc combined with HMMs, the roles of space and time arc reversed. The mutagenetic trees 
arise from dependence among genome sites, and the HMM introduces dependence across time. 
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3.4. Parameter estimation. Consider now clonal sequence data for a set of patients indexed 
by [N] = {!,... , A^}. For each patient i G [A^], we have observations at time points tn < 
ti2 < • ■ ■ < tiJ^ . We denote by Xij^ the indicator of occurrence of mutation m in the viral 
population state of patient i at time point Uj. The random variable l^jfem indicates the 
occurrence of mutation m in clone k G [Kij] of patient i at time point Uj. We denote the 
transition matrices (Equation 1) by 6ij{Xm)- For example, 

^y(An.)oi,o = e-^'"(**^-**-^-i). 

We assume patients to be independent and each patient to follow the mtree-HMM based 
on a fixed tree T. Then the resulting model for the observations 

Y = {Yijkv I i e [N], j = l,...,Ji, ke [Kij], m G [M]), 

made at time points {tij \ i G [N], j = 1, . . . , J^}, has the likelihood function 

-^obs(A, = ^ 

xiieC{T) 



where we set Xjom '■= 1 and tio := for all i G [A'] and m G [M]. The outer sums make the like- 
lihood function hard to maximize. In order to solve this optimization problem and to obtain 
maximum likelihood estimates, we apply an Expectation Maximization (EM) algorithm. 

Let {xijm} be values of the hidden variables {Xij^} that are compatible with the unob- 
served mutagenetic tree Markov chain model. Then it must hold that Xijm = 1 whenever 
Xi,j-i,m = 1 and Xijm = whenever Xypa(m) = 0- Let / be the indicator function and set 

Xijkmi^^^) ~ ^{^ijm — 0,i Uijkm — ^}) 

for a, 6 = 0, 1. The log-likelihood function ^hid(A, e~) of the hidden model is the sum over 
all i G [N], m e M, and j = 1, . . . , of the expressions 

- Xiim(O) Xm iUj - t,j_i) + Xijmil) log (l - e^"(*^^-*^'^-i)) 

Xijkm 

(0, 0) log(l - £+ ) + Xijfem(0> 1) log 

kG[Kij] 

+ Xijfcm(l> 0) log + Xijfem(l> 1) log(l " ^m) • 

Maximization of £hid is feasible since it does not involve the sums that are present in Lobs- 

The EM algorithm iteratively alternates between its E-step, in which the expectations of 
the indicators Xijm and x'ijkm computed for fixed parameter estimates, and its M-step, in 
which these expectations are used to maximize the expectation of the log-likelihood function 
£hid and to obtain new parameter estimates. Details of the EM algorithm, including the choice 
of starting solutions and the bootstrap procedure for confidence intervals, are given in the 
Appendix. 



E n n n 

XNj^eC{T) ielN] me[M] j=i 



ijm lyijkm j ' 

kelKii] 
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Figure 5. Mutagenetic tree model for the development of resistance to 
efavirenz in the HIV-1 reverse transcriptase. Vertices are labeled with amino 
acid substitutions. 

4. Results 

4.1. Tree reconstruction. Employing the cross-sectional tree reconstruction method yielded 
the mutagenetic tree on 8 vertices displayed in Figure 5 with parameters A shown in Fig- 
ure 6(a). In this mutagenetic tree model, predominantly mutation 103N initiates the devel- 
opment of efavirenz resistance and is followed by 1001, lOlQ, 1081, or 225H. A parallel but 
strongly retarded pathway involves mutation 190S followed by lOlE. Based on this cross- 
sectional analysis, the expected waiting time for mutation 103N to occur is estimated to 

~ "^-^ weeks of efavirenz combination therapy. By contrast, mutation 190S is estimated 
to occur after 113 weeks. 

The mutagenetic tree model has 51 compatible states. The transition matrix of the corre- 
sponding HMM has 588 nonzero entries out of 51^ = 2601 (cf. Appendix). 

4.2. Inferred population states. Using the fixed tree topology, we applied the EM algo- 
rithm to the longitudinal clonal data. The Viterbi algorithm was used to obtain maximum 
a posteriori estimates of the hidden variables, i.e. of the viral population states, resulting in 
one Viterbi path for each patient. Given the model parameters, the Viterbi path is the most 
probable sequence of states to explain both the progression in time of the virus population 
along the lattice of compatible states and the observed clones at the respective time points. 
For example, the Viterbi path for patient 22 is illustrated in Table 1. We estimated mutation 
103N to enter the virus population of this patient at week 48 and mutation 225H to be in- 
troduced after 70 weeks of therapy. Mutation 103N had already been present in 2 out of 16 
clones before onset of therapy (week 0), but was most likely not established in the population 
at this time, because viral replication was suppressed for 48 more weeks (Bacheler et al., 2000, 
Fig. 1). Likewise, mutation 225H appears first at week 59 in 1 out of 7 clones, 11 weeks before 
its estimated fixation. Conversely, we estimated the introduction of this mutation at week 70 
despite the fact that 4 out of 8 cloned did not yet harbor it at this time. 

4.3. Rates of progression. The parameters A are conditional rates of fixation of mutations 
in the population, associated with the edges of the mutagenetic tree. Estimation of these 
rates from cross-sectional data depends on the assumption of independent observations and 
of an exponentially distributed sampling time (Equation 2 in the Appendix), both of which 
are not met with the present data set (p < 10^^^, Kolmogorov-Smirnov test). Estimates of A 
in the mtree-HMM are shown in Figure 6(a). 

Mutation 103N is by far the earliest mutation to occur with an expected waiting time of 19 
weeks, as compared to 190S with 478 weeks. Subsequent to 103N, mutation 225H is most likely 
to occur next, followed by 1081, lOlQ, and lOOL Mutation lOlE appears shortly after 190S. 
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Figure 6. Bootstrap estimates of the parameters of the mtree-HMM for the 
development of efavirenz resistance based on the tree in Figure 5. (a) Pro- 
gression rates A of mutations, reported on a log scale. Diamonds indicate the 
cross-sectional estimate, (b) False positive rates e"*". (c) False negative rates 
e~ . 



These findings are in accordance with in vitro measurements of efavirenz resistance (Bacheler 
et al., 2001). Engineered single amino acid substitutions result in 36-fold and 97-fold reduced 
efavirenz susceptibility for 103N and 190S, respectively, as compared to the wild type. By 
contrast, all other mutations cause significantly less efavirenz resistance on their own (1.2- to 
24-fold, median 5.6). Remarkably however, all of the double mutants 103N+100I, 103N+101Q, 
103N-hl08I, 103N-f225H as well as the triple mutant 103N-hl08I+225H were measured highly 
resistant to efavirenz (84- to 2, 400- fold). Thus, phenotypic resistance increases along the 
mutagenetic tree, and the ordered accumulation appears to result from a specific phenotypic 
gradient. 

Comparing the cross-sectional estimates of A to their longitudinal counterparts obtained 
from bootstrapping reveals consistent overestimation of progression rates by the cross-sectional 
approach (Figure 6(a)). This effect may be due to the high number of clones that have been 
sequenced prior to therapy start at week 0, which leads to an estimated sampling time of only 
22 weeks in the cross-sectional approach. 

4.4. Clonal variation. The variation of clones is modeled by mutation-specific parameters 
and e~ that denote the probability of a false positive and a false negative mutation, 
respectively. Figures 6(b) and 6(c) show the bootstrap estimates of these parameters. False 
positives occur when a mutation is not yet fixed into the population, but is already present 
on single clones. We found very low false positive rates (mean < 0.02) for all mutations. 

We obtained higher estimates for e~ , the rates of false negatives. False negatives occur 
when a mutation is estimated to be fixed into the virus population, but does not appear in 
subsequent clones. This happens more frequently, because the model assumes non-reversibility 
of mutations in the population. The false negative rate of 103N was by far the lowest, 
presumably due to its strong selective advantage both as a single mutation and in combination 
with others. Mutations 190S and lOlE showed high false negative rates but these estimates 
were also subject to a considerable variance. In the case of 190S and also lOlQ, the high 
false negative rate is not as surprising given that the randomization tests in Section 2.2 
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indicated that the assumption of non-reversibihty of substitution is less warranted for these 
two mutations. 

5. Limitations and conclusions 

RNA viruses provide prominent examples of measurably evolving populations, and precise 
predictions of the outcome of this evolutionary process are a prerequisite for the rational design 
of antiretroviral treatment protocols. Insight in the evolutionary process can be obtained from 
longitudinal genomic data. For analysis of such data, we have presented a mutagenetic tree 
hidden Markov model. We applied this model in an analysis of longitudinal clonal sequence 
data on the evolution of drug resistance in HIV. Our focus was on estimating the order and 
rate of occurrence of seven single amino acid changes that are associated with resistance to 
the drug efavirenz. 

Our model of clonal variation assumes that the clones are possibly erroneous copies of a 
fixed but unobserved population state. Thus, we regard the virus population as consisting of 
a single dominating strain and a cloud of closely related mutants. While this assumption is 
justified by the strong selective pressure exerted by antiretroviral therapy, it does not allow 
for explicit modeling of different lineages following different evolutionary pathways. Such 
parallel developments were very rare in the present data set. For example, out of the 7 viral 
clones derived from patient 126 at week 59, 3 clones displayed the mutational pattern 103N- 
225H, and 5 mutations 190S-101E. The estimated population state was the union of these 
four mutations, but none of the observed clones actually displayed this pattern. In fact, these 
parallel lineages contribute to the elevated false negative rates of mutations lOlE and 190S. 
Possible ways of extending the mtree-HMM to account for this effect include modeling the 
population state as more than one strain, and grouping clones into different pathways. 

Evolution in our model is assumed to result from mutation and selection. Incorporating 
recombination would alter our set-up considerably. However, the mutations we considered lie 
no more than 125 codons, or 375 nucleotides, apart. Given the HIV genome of length 10,000 
base pairs, recombination is unlikely to occur in this small region, but cannot be ruled out 
entirely. 

Finally, we emphasize that throughout we have ignored the interval censoring inherent in 
the data and assumed that mutations occur at the time they are observed. However, ignoring 
the interval censoring is unlikely to have a strong effect on the parameter estimates because of 
the small interval lengths. In a related analysis of data with the same time interval structure, 
Foulkes and DeGruttola (2003) have found no qualitative and little quantitative difference 
when using the midpoint of time intervals as the time at which mutations occur. Nevertheless, 
modeling interval censoring will become more important for larger time intervals, for example 
due to longer periods of suppressed virus replication below detectable limits. 

Acknowledgments 

Niko Beerenwinkel is supported by Deutsche Forschungsgemeinschaft (DFG) under grant 
No. BE 3217/1-1. Mathias Drton acknowledges support from NIH grant R01-HG02362-03 
and NSF grant DMS-0505612. 

References 

L. Bacheler, S. Jeffrey, G. Hanna, R. D'Aquila, L. Wallace, K. Logue, B. Cordova, K. Hertogs, 
B. Larder, R. Buckery, D. Baker, K. Gallagher, H. Scarnati, R. Tritch, and C. Rizzo. 



16 N. BEERENWINKEL AND M. DRTON 

Genotypic correlates of phenotypic resistance to efavirenz in virus isolates from patients 
failing nonnucleoside reverse transcriptase inhibitor therapy. J Virol, 75(ll):4999-5008, 
Jun 2001. URL http : //dx . doi . org/10 . 1128/ JVI . 75 . 11 . 4999-5008 . 2001. 

L. T. Bacheler, E. D. Anton, P. Kudish, D. Baker, J. Bunville, K. Krakowski, L. Boiling, 
M. Aujay, X. V. Wang, D. Ellis, M. F. Becker, A. L. Lasut, H. J. George, D. R. Spalding, 

G. Hollis, and K. Abremski. HTiman immunodeficiency virus type 1 mutations selected 
in patients failing efavirenz combination therapy. Antimicrob Agents Chemother, 44(9): 
2475-2484, Sep 2000. 

N. Beerenwinkel, M. Daumer, T. Sing, J. Rahnenfiihrer, T. Lengauer, J. Selbig, D. Hoffmann, 
and R. Kaiser. Estimating HIV evolutionary pathways and the genetic barrier to drug 
resistance. J Infect Dis, 191(11):1953-1960, Jun 2005a. URL http://dx.doi.org/10. 
1086/430005. 

N. Beerenwinkel and M. Drton. Mutagenetic tree models. In L. Pachter and B. Sturm- 
fels, editors. Algebraic Statistics for Computational Biology, chapter 14, pages 278-290. 
Cambridge University Press, Cambridge, UK, 2005. 

N. Beerenwinkel, N. Eriksson, and B. Sturmfels. Evolution on distributive lattices, submitted, 
2006. URL http://arxiv.org/abs/q-bio/0511039. 

N. Beerenwinkel, J. Rahnenfiihrer, M. Daumer, D. Hoffmann, R. Kaiser, J. Selbig, and 
T. Lengauer. Learning multiple evolutionary pathways from cross-sectional data. J 
Comput Biol, 12(6):584-598, 2005b. URL http://dx.doi.org/10. 1089/cmb. 2005. 
12.584. 

N. Beerenwinkel, J. Rahnenfiihrer, R. Kaiser, D. Hoffmann, J. Selbig, and T. Lengauer. 
Mtreemix: a software package for learning and using mixture models of mutagenetic 
trees. Bioinformatics, 21(9):2106-2107, May 2005c. URL http://dx.doi.org/10. 
1093/bioinformatics/bti274. 

N. Beerenwinkel, T. Sing, T. Lengauer, J. Rahnenfiihrer, K. Roomp, I. Savcnkov, R. Fis- 
cher, D. Hoffmann, J. Selbig, K. Korn, H. Walter, T. Berg, P. Braun, G. Fatkenheuer, 
M. Oette, J. Rockstroh, B. Kupfcr, R. Kaiser, and M. Daumer. Computational methods 
for the design of effective therapies against drug resistant HIV strains. Bioinformatics, 
21(21):3943-3950, Sep 2005d. URL http://dx.doi.org/10.1093/bioinformatics/ 
bti654. 

C. A. Boucher, E. O'Sullivan, J. W. Mulder, C. Ramautarsing, P. KeUam, G. Darby, J. M. 
Lange, J. Goudsmit, and B. A. Larder. Ordered appearance of zidovudine resistance 
mutations during treatment of 18 human immunodeficiency virus-positive subjects. J 

Infect Dis, 165(1):105-110, Jan 1992. 

F. Clavel and A. J. Hance. HIV drug resistance. N Engl J Med, 350(10):1023-1035, Mar 2004. 
URL http : //dx . doi . org/10 . 1056/NEJM2ra025195. 

K. A. Crandall, C. R. Kelsey, H. Imamichi, H. C. Lane, and N. P. Salzman. Parallel evolution 
of drug resistance in HIV: failure of nonsynonynious/synonymous substitution rate ratio 
to detect selection. Mol Biol Evol, 16(3):372-382, Mar 1999. 

R. Desper, F. Jiang, O. P. Kallioniemi, H. Moch, C. H. Papadimitriou, and A. A. Schaffer. 
Inferring tree models for oncogenesis from comparative genome hybridization data. J 
Comput Biol, 6(1):37-51, 1999. 



MUTAGENETIC TREE HMM 



17 



A. J. Drummond, G. K. Nicholls, A. G. Rodrigo, and W. Solomon. Estimating mutation 
parameters, population history and genealogy simultaneously from temporally spaced 
sequence data. Genetics, 161 (3): 1307-1320, Jul 2002. 

A. J. Drummond, O. G. Pybus, A. Rambaut, R. Forsberg, and A. G. Rodrigo. Measurably 
evolving populations. TRENDS Ecol Evol, 18(9):481-488, Sep 2003. 

R. Durbin, S. Eddy, A. Krogh, and G. Mitchison. Biological sequence analysis: Probabilistic 
models of proteins and nucleic acids. Cambridge University Press, Cambridge, UK, 
1998. 

A. Foulkes and V. DeGruttola. Characterizing the progression of viral mutations over time. 
J Am Stat Assoc, 98(464) :859-867, 2003. 

M. J. Gonzales, T. D. Wu, J. Taylor, I. Belitskaya, R. Kantor, D. Israelski, S. Chou, A. R. 
Zolopa, W. J. Fessel, and R. W. Shafer. Extended spectrum of HIV-1 reverse transcrip- 
tase mutations in patients receiving multiple nucleoside analog inhibitors. AIDS, 17(6): 
791-799, Apr 2003. URL http : //dx . doi . org/10 . 1097/01 . aids . 0000050860 . 71999 . 
23. 

I. Hallgrimsdottir, R. Milowski, and J. Yu. The EM algorithm for hidden Markov models. 

In L. Pachter and B. Sturmfels, editors, Algebraic Statistics for Computational Biology, 
chapter 12, pages 248-261. Cambridge University Press, Cambridge, UK, 2005. 

S. Lauritzen. Graphical Models. Clarendon Press, 1996. 

J. D. McAuliffe, L. Pachter, and M. I. Jordan. Multiple-sequence functional annotation and 
the generalized hidden Markov phylogeny. Bioinformatics, 20(12):1850-1860, Aug 2004. 
URL http : //dx . doi . org/10 . 1093/bioinf ormatics/bthlSS. 

Y. Michalakis and D. Roze. Evolution. Epistasis in RNA viruses. Science, 306(5701):1492- 
1493, Nov 2004. URL http://dx.doi.org/10.1126/science.1106677. 

A. Molla, M. Korncycva, Q. Gao, S. Vasavanonda, P. J. Schipper, H. M. Mo, M. Markowitz, 
T. Chernyavskiy, P. Niu, N. Lyons, A. Hsu, G. R. Granneman, D. D. Ho, C. A. Boucher, 
J. M. Leonard, D. W. Norbeck, and D. J. Kempf. Ordered accumulation of mutations 
in HIV protease confers resistance to ritonavir. Nat Med, 2(7):760-766, Jul 1996. 

J. Nelder and R. Mead. A simplex method for function minimization. Computer Journal, 7: 
308-313, 1965. 

J. Rahnenfiihrer, N. Beerenwinkel, W. A. Schulz, C. Hartmann, A. von Deimling, B. Wullich, 
and T. Lengauer. Estimating cancer survival and clinical outcome based on genetic 
tumor progression scores. Bioinformatics, 21(10):2438— 2446, May 2005. URL http: 
//dx . doi . org/10 . 1093/bioinf oriiiatics/bti312. 

S.-Y. Rhee, M. J. Gonzales, R. Kantor, B. J. Belts, J. Ravela, and R. W. Shafer. Human 
immunodeficiency virus reverse transcriptase and protease sequence database. Nucleic 
Acids Res, 31(l):298-303, Jan 2003. 

T.-K. Seo, J. L. Thorne, M. Hasegawa, and H. Kishino. Estimation of effective population 
size of HIV-1 within a host: a pseudomaximum-likelihood approach. Genetics, 160(4): 
1283-1293, Apr 2002. 

R. W. Shafer and J. M. Schapiro. Drug resistance and antiretroviral drug development. J 
Antimicrob Chemother, 55(6):817-820, Jun 2005. URL http: //dx. doi . org/10 . 1093/ 



18 



N. BEERENWINKEL AND M. DRTON 



jac/dkil27. 

A. Siepel and D. Haussler. Combining phylogcnctic and hidden Markov models in biosequence 
analysis. J Comput Biol, ll(2-3):413-428, 2004. URL http: //dx.doi . org/10 . 1089/ 
1066527041410472. 

T. Sing, V. Svicher, N. Beerenwinkel, F. Cecclierini-Silberstein, M. D. R. Kaiser, H. Walter, 
K. Korn, D. Hoffmann, M. Oette, J. K. Rockstroh, G. Fatkenheuer, C.-F. Perno, and 
T. Lengauer. Characterization of novel HIV drug resistance mutations using clustering, 
multidimensional scaling, and SVM-based feature ranking. In 9th European Conference 
on Principles and Practice of Knowledge Discovery in Databases, 2005. 

A. von Heydebreck, B. Gunawan, and L. Fiizesi. Maximum likelihood estimation of onco- 
genetic tree models. Biostatistics, 5{4):M5-~556, 2004. URL http: //biostatistics. 
oup journals . org/ cgi/ content/abstract /5/4/545?etoc. 

S. Williamson. Adaptation in the env gene of HIV-1 and evolutionary theories of disease 
progression. Mol Biol Evol, 20(8):1318-1325, Aug 2003. URL http://dx.doi.org/10. 
1093/molbev/msgl44. 



Appendix A. Expectation maximization algorithm 

We provide here a detailed description of the EM algorithm from Section 3.4 that is used 
to estimate the parameters of the mtree-HMM. 

A.l. M-step. The log-likelihood function l-^^^ of the hidden model decomposes into a sum 
involving only A and another sum involving only and £~ . Let 

'i^ijm,a ~ Prob(xjj_i^y7j ~ Oj •^i_;pa(m) ~ -'-) •^ijm ~ ^ | ^)) 

be the conditional expectation of Xijm{a) given the clonal observations Y. For maximum 
likelihood estimation of A, we have to maximize the function 

ie[N]me[M]3='i^ 

This task is complicated by the log terms, but can be accomplished numerically by simple 
general optimization methods, such as the simplex method of Nelder and Mead (1965). 
For estimating £+ and e~ let 

Ji 

u'm,ab = = yijkm = b \ Y) 

ie[N] j=i ke[Kij] 

be the conditional expectation of the corresponding sum of x'ijkmi^^ b) given the clonal obser- 
vations Y. Then the maximum likelihood estimates are 
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A. 2. E-step. The mtree-HMM is a submodel of a standard HMM with hidden state space 
equal to the set C{T) of states compatible with the mutagenetic tree T. Therefore, we can 
compute the conditional expectations Uij^^a and u'm ab applying, separately for each patient, 
the forward- backward equations of the standard HMM (Durbin et al., 1998; Hallgrimsdottir 
et al., 2005). In the standard HMM only certain transitions between compatible states can 
occur with non-zero probability, namely those that respect the partial order of the lattice 



induced by T. The probability Prob(Xjj 
to state X G C{T) is given by 



X 



of transitioning from state x G C(T) 



me[M] 



PT0b{W=x) 



^m,2;pa(m))i^" 



if X ^ X, 

else, 



where C = "^x'^x Prob(VF = x') and W 2 is a random vector distributed according to the 
timed mutagenetic tree model based on the tree T and the time interval At = tij —tij-i. The 
number of compatible states depends on the topology of T but is bounded by M+1 < |C(r)| < 
2^ . Thus the number of non-zero transition probabilities also depends on the topology of T. 
For example, the transition matrix arising from the tree T shown in Figure 3(a) is of the form 
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where asterisks indicate entries that arc not restricted to zero. 

The remaining structure of the standard HMM is determined by the probability of emitting 
clones Hiji, . . . , VijKij from a hidden state x G C{T), which is equal to 



n n 



TTl/Xm^yijkv 



Applying the forward-backward equations to this HMM yields, for each patient i and time 
point tij, the conditional expectation of the number of transitions Uij^xx from state x G C{T) 
to state X G C{T), and the conditional expectation of the number of emissions U^j of clone 
y El from state x G C{T). From these conditional expectations, we can obtain the quantities 



u. 



ijm,a 



E 



IJ,XX1 



{{x,x)eC{T)xC{,T) : 



^m,af> ~ X/ X/ X/ X/ 

iG[iV] mG[M] i=l {(^,H)eC(T) xl : 
xm=at Vm=b} 



U'- 



that are needed in the M-step of the EM algorithm for the mtree-HMM. 

If several vertices of the mutagenetic tree share the root vertex as parent, additional 
computational saving is possible by employing that variables in different subtrees rooted at 
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are mutually independent. In this case several standard HMMs with smaller state spaces can 
replace the one standard HMM. 

A. 3. Starting solutions. As initial parameter values in the mtree-HMM we used 0.1 for 
all error probabilities and e~ . The initial rates A were derived from the cross-sectional 
approach by assuming an exponential distribution with rate A^^ for the sampling times. Under 
this assumption A can be computed from 

(2) ^- = ^^, m^m, 

where both t?^ and At were estimated directly from the cross-sectional data (Rahnenfiihrer 
et al., 2005). 

A.4. Confidence intervals. We obtained confidence intervals for all model parameters A, 
£"*", and e~ from 100 bootstrap runs of the EM algorithm. Resampling of the data was 
performed in two steps. First, a set of 163 patients was drawn at random. Second, for each 
patient and each time point, the corresponding set of clones was used for resampling the 
respective number of clones. Thus, the number of patients, the structure of sampling times, 
and the number of clones equaled those of the original data set in all bootstrap runs. 
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